We propose a novel video object segmentation algorithm based on pixel-levelmatching using Convolutional Neural Networks (CNN). Our network aims todistinguish the target area from the background on the basis of the pixel-levelsimilarity between two object units. The proposed network represents a targetobject using features from different depth layers in order to take advantage ofboth the spatial details and the category-level semantic information.Furthermore, we propose a feature compression technique that drasticallyreduces the memory requirements while maintaining the capability of featurerepresentation. Two-stage training (pre-training and fine-tuning) allows ournetwork to handle any target object regardless of its category (even if theobject's type does not belong to the pre-training data) or of variations in itsappearance through a video sequence. Experiments on large datasets demonstratethe effectiveness of our model - against related methods - in terms ofaccuracy, speed, and stability. Finally, we introduce the transferability ofour network to different domains, such as the infrared data domain.
展开▼